feat(optimizer): [0/N] Optimizer Data Model by mkuchenbecker · Pull Request #527 · linkedin/openhouse

mkuchenbecker · 2026-04-03T18:17:52Z

Optimizer Stack

PR	Content
#527 (this)	Data Model
#530	Database Repos
#531	REST service
#533	Analyzer app
#534	Scheduler app
#tbd	Spark BatchedOFD app
#tbd	Infra, docker-compose, smoke test

Summary

PR 0 of N in the optimizer stack.
Overall Project
Service Design doc.

Introduces the optimizer service module mysql data model.

Changes

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

This PR contains only the data model (entities, DTOs, converters). Repository tests follow in PR 1. Verified:

./gradlew :services:optimizer:compileJava passes
./gradlew compileJava (full project) passes with no regressions
Spotless formatting passes

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

Introduces the optimizer service module with: - MySQL/H2 schema for table_operations, table_stats, table_stats_history, and table_operations_history - JPA entities with JSON column support (vladmihalcea hibernate-types) - All model/DTO/enum types: OperationType, OperationStatus, TableStats, CompleteOperationRequest, JobResult, OperationMetrics, etc. - JPA AttributeConverters for JobResult and OperationMetrics JSON columns - MapStruct mapper (OptimizerMapper) for entity→DTO conversion - Spring Boot application shell and build wiring (settings.gradle, build.gradle dockerPrereqs) No repositories, controllers, or service layer yet — those follow in subsequent PRs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove OperationMetrics class and converter; stats are read directly from table_stats instead of duplicating into operations - Remove orphanFilesDeleted/orphanBytesDeleted from history entity, DTO, and schema; operation-specific data belongs in the result JSON - Add addedSizeBytes to CommitDelta for tracking write volume - Fix OperationType javadoc to describe current state, not roadmap - Fix TableOperationsHistoryRow javadoc: written on operation complete, not by Spark app directly - Add field comments to all DTOs and request objects Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

These fields never belonged in the data model — remove them at the source rather than adding then deleting in a later PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

abhisheknath2011 · 2026-04-07T21:43:55Z

+
+/** Terminal states for a completed Spark maintenance job. */
+public enum OperationHistoryStatus {
+  SUCCESS,


We should have keep the existing status such as canceled, queued etc. These are valid status as some times jobs could not be submitted due to GGW/Yarn issue etc.

Happy to add these, but it's unclear how they would be populated given the current lifecycle. My general position is to add states as they are actually used.

abhisheknath2011 · 2026-04-07T21:46:21Z

+  private String jobId;
+
+  /** Reserved for future per-operation metadata; currently unused. */
+  private String metrics;


Can we have a class instead to capture more info? Or do we plan to capture json string here?

Claude: Removed the metrics field from the DTO, entity, and schema since it is unused. A -- TODO block comment in the schema notes that per-operation metric columns will be added as operations are onboarded.

abhisheknath2011 · 2026-04-08T00:00:20Z

+  /** Same UUID as the originating {@code table_operations.id}. Set by the caller; not generated. */
+  @Id
+  @Column(name = "id", nullable = false, length = 36)
+  private String id;


Looks like this UUID and generated as part of job submission?

Claude: Yes, it is a UUID, but it is not generated at submission. It is the same UUID as the originating table_operations.id, set by the Analyzer when the recommendation row is first created. The history row reuses that UUID when the complete endpoint is called, so each cycle is traceable end-to-end by a single id.

abhisheknath2011 · 2026-04-08T00:09:26Z

+  private String tableUuid;
+
+  @Column(name = "database_name", nullable = false, length = 255)
+  private String databaseName;


This seems to be 128 char long in the current prod schema.

Claude: Updated to VARCHAR(128) to align with the existing prod schema. Applied across TableOperationsRow, TableOperationsHistoryRow, TableStatsRow, TableStatsHistoryRow, and the SQL schema.

abhisheknath2011 · 2026-04-08T00:09:51Z

+  @Column(name = "database_name", nullable = false, length = 255)
+  private String databaseName;
+
+  @Column(name = "table_name", nullable = false, length = 255)


table name is also 128 char long. But yeah we can double check.

Claude: Updated to VARCHAR(128) for consistency.

abhisheknath2011 · 2026-04-08T00:20:26Z

+  @Id
+  @GeneratedValue(strategy = GenerationType.IDENTITY)
+  @Column(name = "id", nullable = false)
+  private Long id;


Is this auto increment id or primary key?

Claude: Changed to a UUID (VARCHAR(36)) for consistency with TableOperationsRow and TableOperationsHistoryRow. It is the primary key, set by the caller; no DB-side auto-allocation.

maluchari · 2026-04-13T16:42:18Z

+  @Column(name = "table_uuid", nullable = false, length = 36)
+  private String tableUuid;
+
+  @Column(name = "database_id", nullable = false, length = 255)


can we use only database_name for consistency?

Claude: Renamed databaseId → databaseName (column database_id → database_name) in both TableStatsRow and TableStatsHistoryRow, matching the operations entities.

maluchari · 2026-04-13T16:44:23Z

+-- Optimizer Service Schema
+-- Compatible with MySQL (production) and H2 in MySQL mode (tests).
+CREATE TABLE IF NOT EXISTS table_operations (
+  id             VARCHAR(36)   NOT NULL,


Can we consider adding indexes for these tables too?

Plan is to add indexes when query patterns require them — current PKs cover the access paths used by the analyzer and scheduler today. All key columns are VARCHAR/TIMESTAMP, so secondary indexes can be added cheaply once a real query path needs one.

maluchari · 2026-04-13T16:48:32Z

+
+  /** When the operation completed, as recorded by the complete endpoint. */
+  @Column(name = "submitted_at", nullable = false)
+  private Instant submittedAt;


SHould this be completionTime instead?

Claude: Renamed submittedAt → completedAt (column submitted_at → completed_at, index idx_submitted_at → idx_completed_at). The history row is written when the complete endpoint is called, so the timestamp captures completion. Submission time is already on table_operations.scheduled_at.

maluchari · 2026-04-13T16:52:14Z

+  @Builder(toBuilder = true)
+  @NoArgsConstructor
+  @AllArgsConstructor
+  public static class CommitDelta {


does this also require @JsonIgnoreProperties ? could provide forward compatibility or safeguard during upgrades in case of new fields addition

Claude: Added @JsonIgnoreProperties(ignoreUnknown = true) to CommitDelta for consistency with TableStats and SnapshotMetrics.

@JsonIgnoreProperties

- Widen-to-tighten: VARCHAR(255) -> VARCHAR(128) for database_name and table_name across all entities and the schema, aligning with prod conventions (can always be widened later, not tightened). - Rename databaseId -> databaseName in TableStatsRow, TableStatsHistoryRow, TableStatsDto, TableStatsHistoryDto, and UpsertTableStatsRequest for consistency with the operations entities and DTOs. - Drop the unused metrics field from TableOperationsRow, TableOperationsDto, and the schema. Add a TODO note in the schema that per-operation metric columns will be added as operations are onboarded. - Rename submittedAt -> completedAt in TableOperationsHistoryRow, TableOperationsHistoryDto, and the schema (column submitted_at -> completed_at, index idx_submitted_at -> idx_completed_at). The history row is written when the complete endpoint is called, so the timestamp captures completion; submission time is already on table_operations.scheduled_at. - Change TableStatsHistoryRow.id from BIGINT auto-increment to VARCHAR(36) UUID, set by the caller, matching the other id-bearing entities. - Add @JsonIgnoreProperties(ignoreUnknown = true) to CommitDelta for consistency with TableStats and SnapshotMetrics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…le_name) Add a composite secondary index on (database_name, table_name) to table_operations_history at the schema and entity layers. This backs a new name-based history-lookup endpoint added on optimizer-2; without the index, the query degrades to a full scan on a table that grows with every operation completion. The other three optimizer tables get no new indexes — no new query patterns on them this round. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>